Distribution-Preserving Statistical Disclosure Limitation1
نویسندگان
چکیده
One approach to limiting disclosure risk in public-use microdata is to release multiply-imputed, partially synthetic data sets. These are data on actual respondents, but with con dential data replaced by multiply-imputed synthetic values. A mis-speci ed imputation model can invalidate inferences based on the partially synthetic data, because the imputation model determines the distribution of synthetic values. We present a practical method to generate synthetic values when the imputer has only limited information about the true data generating process. We combine a simple imputation model (such as regression) with density-based transformations that preserve the distribution of the con dential data, up to sampling error, on speci ed subdomains. We demonstrate through simulations and a large scale application that our approach preserves important statistical properties of the con dential data, including higher moments, with low disclosure risk. Keywords: statistical disclosure limitation, con dentiality, privacy, multiple imputation, partially synthetic data Note to Editor: Appendicized Figures are included for reference only. They are not intended for publication.
منابع مشابه
Distribution-preserving statistical disclosure limitation
One approach to limiting disclosure risk in public-use microdata is to release multiply-imputed, partially synthetic data sets. These are data on actual respondents, but with con dential data replaced by multiply-imputed synthetic values. A mis-speci ed imputation model can invalidate inferences because the distribution of synthetic data is completely determined by the model used to generate th...
متن کاملEstimation of Anonymous Email Network Characteristics through Statistical Disclosure Attacks
Social network analysis aims to obtain relational data from social systems to identify leaders, roles, and communities in order to model profiles or predict a specific behavior in users' network. Preserving anonymity in social networks is a subject of major concern. Anonymity can be compromised by disclosing senders' or receivers' identity, message content, or sender-receiver relationships. Und...
متن کاملStatistical Disclosure Control for Data Privacy Preservation
With the phenomenal change in a way data are collected, stored and disseminated among various data analyst there is an urgent need of protecting the privacy of data. As when individual data get disseminated among various users, there is a high risk of revelation of sensitive data related to any individual, which may violate various legal and ethical issues. Statistical Disclosure Control (SDC) ...
متن کاملPrivacy-Preserving Data Mining
Privacy-preserving data mining (PPDM) refers to the area of data mining that seeks to safeguard sensitive information from unsolicited or unsanctioned disclosure. Most traditional data mining techniques analyze and model the data set statistically, in aggregation, while privacy preservation is primarily concerned with protecting against disclosure individual data records. This domain separation...
متن کامل